{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Stochastic Gradient Descent\n",
    "\n",
    "[$\\theta_{j} := \\theta_{j} - \\alpha \\cdot (\\hat{y}^{i} - y^{i}) \\cdot x_{j}^{i}$  for i in range (m)]\n",
    "    \n",
    "Basically, in SGD, we are using the cost gradient of $\\textbf{1 example}$ at each iteration, instead of using the sum of the cost gradient of $\\textbf{ALL}$ examples\n",
    "\n",
    "Used the cost function of Linear Regression. Also, I formatted the above pseudocode using list compression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def SGD(f,theta0,alpha,num_iters):\n",
    "    \"\"\" Arguments:\n",
    "        f -- the function to optimize, it takes a single argument and yields two outputs,\n",
    "             a cost and the gradient with respect to the arguments\n",
    "        theta0 -- the initial point to start SGD from\n",
    "        num_iters -- total iterations to run SGD for\n",
    "        \n",
    "        Return:\n",
    "        theta -- the parameter value after SGD finishes\n",
    "    \"\"\"\n",
    "    start_iter = 0\n",
    "    theta = theta0\n",
    "    \n",
    "    for iter in range(start_iter + 1, num_iter +1):\n",
    "        _, grad = f(theta)\n",
    "        theta = theta - (alpha * grad) # Note: there is NO dot product!\n",
    "    \n",
    "    return theta "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Few Things to Note:\n",
    "\n",
    "1) In SGD, before for-looping, you need to randomly shuffle the training examples.\n",
    "\n",
    "2) In SGD, because it's using only one example at a time, its path to the minima is noiser (more random) than that of the batch gradient. But it's ok as we are indifferent to the path, as long as it gives us the minimum AND shorter training time. \n",
    "\n",
    "3) Mini-batch gradient descent uses $\\textbf{n}$ data points (instead of $\\textbf{1}$ sample in SGD) at each iteration."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
